feat: add dynamic shapes kernel specialization strategy for TRT-RTX by tp5uiuc · Pull Request #4184 · pytorch/TensorRT

tp5uiuc · 2026-04-12T20:48:04Z

Description

Expose IRuntimeConfig.setDynamicShapesKernelSpecializationStrategy() through the Torch-TensorRT Python API for TensorRT-RTX builds.

Users can now control how shape-specialized kernels are compiled at runtime for dynamic shapes via the new dynamic_shapes_kernel_specialization_strategy compilation setting:

"lazy" (default): Compile shape-specialized kernels in the background, use fallback until ready
"eager": Compile immediately (blocking)
"none": Always use fallback kernels, never specialize

Depends on: #4180 (runtime cache API — provides the IRuntimeConfig infrastructure)

Type of change

New feature (non-breaking change which adds functionality)

Checklist:

My code follows the style guidelines of this project (You can use the linters)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas and hacks
I have made corresponding changes to the documentation
I have added tests to verify my fix or my feature
New and existing unit tests pass locally with my changes
I have added the relevant labels to my PR in so that relevant reviewers are notified

Expose IRuntimeConfig.setDynamicShapesKernelSpecializationStrategy() through the Torch-TensorRT Python API. Users can now control how shape-specialized kernels are compiled at runtime for dynamic shapes on TensorRT-RTX via the new `dynamic_shapes_kernel_specialization_strategy` compilation setting ("lazy", "eager", or "none"). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Address review feedback: compile with torchtrt.Input min/opt/max ranges so dynamic shapes are actually exercised. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

lanluo-nvidia

lgtm, one minor comment.

lanluo-nvidia · 2026-04-20T16:47:04Z

        hardware_compatible (bool): Build the TensorRT engines compatible with GPU architectures other than that of the GPU on which the engine was built (currently works for NVIDIA Ampere and newer)
        timing_cache_path (str): Path to the timing cache if it exists (or) where it will be saved after compilation. Not used for TensorRT-RTX.
        runtime_cache_path (str): Path to the runtime cache for TensorRT-RTX JIT compilation results. Not used for standard TensorRT.
+        dynamic_shapes_kernel_specialization_strategy (str): Strategy for dynamic shape kernel specialization at runtime (TensorRT-RTX only). Options: "lazy", "eager", "none". Default: "lazy".


Can we add a warning or check in case user configured dynamic_shapes_kernel_specialization_strategy in TensorRT

This is a good suggestion Lan, I have a followup task to emit user warnings for

timing cache used in TRT-RTX

runtime cache used in standard TRT

dynamic shape strategy used in standard TRT

cudagraphs flag used in standard TRT
so that its easier to review the change/behavior. I will put it in then

Address the structural PR feedback by extracting TensorRT-RTX-specific IRuntimeConfig state into its own type and collapsing the per-feature appliers that previously scattered `#ifdef TRT_MAJOR_RTX` through TRTEngine. What - New core/runtime/TRTRuntimeConfig.{h,cpp} owns the IRuntimeConfig shared_ptr plus (on TRT-RTX) the IRuntimeCache, runtime-cache path, dynamic shapes kernel strategy, CUDA graph strategy, and the rtx_native_cudagraphs_disabled one-shot flag. All per-feature appliers live there as public members and are no-ops on non-RTX builds, keeping the only `#ifdef TRT_MAJOR_RTX` scatter contained in this new file. - Strategy fields are now strongly-typed enums (`DynamicShapesKernelStrategy`, `CudaGraphStrategyOption`) with matching `to_string`/`to_int` helpers, validated at engine construction via `to_dynamic_shapes_kernel_strategy` / `to_cuda_ graph_strategy_option` rather than raw int ranges. - `TRTEngine::recreate_execution_context` is now backend-agnostic: it calls `runtime_cfg.ensure_initialized`, applies the allocation strategy, and creates the execution context via `createExecutionContext(IRuntimeConfig*)`. Both standard TensorRT and TRT-RTX go through this uniform path; only the three RTX-only setters (`setRuntimeCache`, `setDynamicShapesKernel SpecializationStrategy`, `setCudaGraphStrategy`) stay behind an `#ifdef TRT_MAJOR_RTX` guard inside the struct. - `~TRTEngine` now wraps cleanup in try/catch and delegates cache persistence to `TRTRuntimeConfig::save_runtime_cache_nothrow`, so stack unwinding can no longer propagate a cache-save failure out of the destructor. - `save_runtime_cache_nothrow` uses `std::filesystem` + atomic `tmp+rename` only; file locking is out of scope for this PR and will be introduced in a follow-up once we pick a portable mechanism. - `is_monolithic_capturable` asserts `exec_ctx` is non-null; the three RTX-only appliers `TORCHTRT_ASSERT` that `config` is live before dereferencing. - `disable_rtx_native_cudagraphs` persists the runtime cache before flipping the strategy so any kernels compiled under the internal capture survive to the next reload. - `TRTEngine::to_str` now emits human-readable strategy names (via `to_string(enum)`) instead of integer codes. - New serialization indices (`RUNTIME_CACHE_PATH_IDX`, `DYNAMIC_ SHAPES_KERNEL_STRATEGY_IDX`, `CUDA_GRAPH_STRATEGY_IDX`) are now `#ifdef TRT_MAJOR_RTX`-gated in runtime.h, register_jit_hooks.cpp, the FlattenedState tuple, the serialize/deserialize constructors, and `__obj_flatten__`. Standard TRT builds keep `SERIALIZATION_LEN == 11` so engines serialized there do not carry RTX-only slots. - Python `_TorchTensorRTModule` reads the RTX-only index accessors and writes the RTX-only engine-info slots only when `ENABLED_FEATURES.tensorrt_rtx` is true. Standard TRT users see no new behavior at runtime. - Deduplicated `_compiler.py` arguments after rebase on upstream main where PR pytorch#4184 had already added `dynamic_shapes_kernel_specialization_strategy`. Kept one copy of each arg; `cuda_graph_strategy` is threaded through all three compile() entry points. Build + tests - RTX build on A100 / L40S: libtorchtrt.so and libtorchtrt_ runtime.so link clean, no `#ifdef` diagnostics. Pre-commit checks pass (clang-format, black, isort, ruff, mypy, typos, buildifier). - All 35 runtime-cache/strategy tests pass; regression across test_000_runtime_cache.py (Python runtime), test_002_cudagraphs_ cpp.py, test_005_dynamic_allocation.py is green. Addresses review comments on PR pytorch#4202: - Guarding of new IDX entries and Python accessors on TRT_MAJOR_RTX / ENABLED_FEATURES.tensorrt_rtx. - Encapsulation of RTX-specific state in a dedicated type with enumerated strategies and transparent standard-TRT/RTX behavior. - Destructor exception safety. - Unification of the execution-context creation path via IRuntimeConfig. - Removal of file locking for runtime-cache persistence. - Debug asserts before dereferencing the live IRuntimeConfig. - Human-readable to_str output. - save_runtime_cache invoked from disable_rtx_native_cudagraphs.

Address PR review comments that asked the new C++ runtime tests be folded into existing feature-level files rather than shipped as parallel `*_cpp.py` files. What - Merge `test_000_runtime_cache_cpp.py` into the existing `test_000_runtime_cache.py`. The file already covered the Python runtime path; two new classes (`TestRuntimeCacheCppPersistence`, `TestCppSerializationIndices`) cover the C++ runtime path via `use_python_runtime=False`, and the serialization-index assertions. Skip on non-RTX builds. - Fold the C++ runtime cases for dynamic shapes kernel specialization strategy into `test_001_dynamic_shapes_kernel_ strategy.py` (introduced upstream in PR pytorch#4184). Two new classes (`TestDynamicShapesKernelStrategyCpp`, `TestDynamicShapesKernel StrategyCppInvalidValue`) exercise lazy/eager/none end-to-end and reject invalid strategy names. The pre-existing Python runtime tests remain untouched. - Rename `test_000_cuda_graph_strategy.py` to `test_001_cuda_graph_ strategy.py` to match the `test_001_*` convention used for L1 RTX-only features. When upstream lands the Python runtime counterpart (PR pytorch#4187), both sets fold into the same file. - Add model-level tests: `test_runtime_cache_models.py` gains a `TestRuntimeCacheCppModels` class exercising ResNet18 through the C++ runtime with warm-cache roundtrip. `test_dynamic_shapes_ kernel_strategy_models.py` gains `TestDynamicShapesKernelStrategy CppModels` covering lazy/eager/none on ResNet18 via the C++ runtime. Verified - 35 passed / 3 skipped in the runtime/ tests (merged file plus test_001 strategy files). - No regression in test_002_cudagraphs_cpp.py (8 passed) or test_005_dynamic_allocation.py (1 passed). Addresses PR pytorch#4202 review comments asking for test file merges and the addition of model-level runtime_cache_models.py / dynamic_shapes_kernel_strategy_models.py coverage.

Follow-up to 54f9ccd / 1fa8c82 addressing the second batch of PR pytorch#4202 review feedback. Pure refactor with no user-visible behavior change; all tests green on A100 (35 passed / 3 skipped + 9 regression passed). TRTEngine - Constructor signature simplified: three separate `runtime_cache_path` / `dynamic_shapes_kernel_strategy` / `cuda_graph_strategy` parameters collapsed into a single `TRTRuntimeConfig runtime_cfg` sink parameter. The forwarding ctor std::moves it into the primary ctor, which std::moves it into the member. - String sink parameters (mod_name, serialized_engine, serialized_ metadata) taken by value and moved into members / slugify. - Deserialization constructor routes through the new free function make_runtime_config_from_serialized, which internalizes the TRT_MAJOR_RTX-gated index reads so the constructor itself stays unguarded. - FlattenedState uses a single TRTRTX_FLATTENED_STATE_EXTRAS macro for the three RTX-only tuple entries instead of duplicating the first eleven entries across two branches. - Destructor restored to the pre-refactor structure: torch::cuda:: synchronize runs outside a try block and runtime_cfg.save_runtime_ cache (now noexcept by signature) is called directly. Exception safety is guaranteed by the member's type, not by a defensive try/catch. - __obj_flatten__ and serialize cast enum values via std::underlying_type_t<...> instead of int so serialization stays in lockstep with any future underlying-type change on the enums. TRTRuntimeConfig - Conversion helpers take std::underlying_type_t<Enum> (the declared 32-bit integer type) instead of raw int. Callers at serialization boundaries explicitly std::stoi / static_cast into the right type. - [[nodiscard]] added to to_string, to_dynamic_shapes_kernel_strategy, to_cuda_graph_strategy_option, uses_internal_capture, is_monolithic_ capturable, to_str, and make_runtime_config_from_serialized. - to_string default cases now TORCHTRT_CHECK(false, ...) with the unexpected integer value; std::unreachable is C++23. - set_execution_context_allocation_strategy is now const. - Cache I/O split into two layers: - Free functions load_runtime_cache(path, cache) and save_runtime_cache(path, cache) perform the raw std::filesystem I/O and use TORCHTRT_CHECK on failure -- exception-propagating, easier to test in isolation. - Member TRTRuntimeConfig::save_runtime_cache() is a noexcept wrapper that calls the free function and swallows exceptions via try/catch -- safe from a destructor. The _nothrow suffix is dropped from the member name (the signature now carries that contract). - write_to_str(ostream&) replaced by two functions: a const-correct to_str() -> std::string, and a free operator<<(ostream&, const TRTRuntimeConfig&) that wraps it with "Runtime cfg { ... }" delimiters. TRTEngine::to_str streams the config via the free operator. Python - _settings.py: removed a duplicated dynamic_shapes_kernel_ specialization_strategy field and its duplicated docstring left over from the upstream rebase of PR pytorch#4184 into our changes. Covers review comments 3126538200, 3126541782, 3126547529, 3126549147, 3126682329, 3126683329, 3126693226, 3126715369, 3126725953, 3126736626, 3126738422, 3126745230, 3126747553, 3126749405, 3126764831, 3126772536, 3126786564, 3126803652, 3126816780, 3126818065, 3126818561, 3126819429, 3126823781, 3126840987, 3126846827.

meta-cla Bot added the cla signed label Apr 12, 2026

github-actions Bot requested a review from cehongwang April 12, 2026 20:48

tp5uiuc commented Apr 12, 2026

View reviewed changes

Comment thread tests/py/dynamo/models/test_dynamic_shapes_kernel_strategy_models.py

tp5uiuc commented Apr 12, 2026

View reviewed changes

Comment thread tests/py/dynamo/runtime/test_001_dynamic_shapes_kernel_strategy.py

tp5uiuc mentioned this pull request Apr 14, 2026

feat: add TRT-RTX native CUDA graph support #4187

Draft

7 tasks

narendasan added the backend: TensorRT-RTX label Apr 14, 2026

github-actions Bot requested a review from zewenli98 April 14, 2026 17:44

tp5uiuc force-pushed the feat/trtrtx-dynamic-shapes-strategy branch from c222c72 to 385eec6 Compare April 15, 2026 18:54

tp5uiuc and others added 2 commits April 20, 2026 08:58

test: use dynamic shape inputs in kernel strategy tests

d7619ca

Address review feedback: compile with torchtrt.Input min/opt/max ranges so dynamic shapes are actually exercised. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

tp5uiuc force-pushed the feat/trtrtx-dynamic-shapes-strategy branch from 385eec6 to d7619ca Compare April 20, 2026 15:58

tp5uiuc marked this pull request as ready for review April 20, 2026 16:09

lanluo-nvidia approved these changes Apr 20, 2026

View reviewed changes

lanluo-nvidia merged commit 8903707 into pytorch:main Apr 21, 2026
84 checks passed

tp5uiuc mentioned this pull request Apr 21, 2026

feat(runtime): add TensorRT-RTX runtime cache, dynamic shapes strategy, and native CUDA graph support to C++ runtime #4202

Draft

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add dynamic shapes kernel specialization strategy for TRT-RTX#4184

feat: add dynamic shapes kernel specialization strategy for TRT-RTX#4184
lanluo-nvidia merged 2 commits intopytorch:mainfrom
tp5uiuc:feat/trtrtx-dynamic-shapes-strategy

tp5uiuc commented Apr 12, 2026

Uh oh!

Uh oh!

Uh oh!

lanluo-nvidia left a comment

Uh oh!

lanluo-nvidia Apr 20, 2026

Uh oh!

tp5uiuc Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tp5uiuc commented Apr 12, 2026

Description

Type of change

Checklist:

Uh oh!

Uh oh!

Uh oh!

lanluo-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

lanluo-nvidia Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

tp5uiuc Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants